Load and incorporate observations in pclean main #120

ThomasColthurst · 2024-07-31T19:02:40Z

Doesn't actually do anything unless the T_Schema contains relations for the observation file's column names. (#119 creates such relations.)

cxx/pclean/pclean.cc

ThomasColthurst · 2024-08-05T14:41:50Z

Ping.

emilyfertig · 2024-08-05T19:29:03Z

cxx/pclean/pclean.cc

+  T_observations obs;
+
+  for (const auto& col : df.data) {
+    const std::string& col_name = col.first;


We generally haven't been using const on local variables I don't have a strong opinion either way, but I think we should be consistent in using it everywhere/nowhere/some places according to some rules we agree on (and up to now we've defaulted to "nowhere"). WDYT?

The Google style guide says "Using const on local variables is neither encouraged nor discouraged."

I don't care much about this, but I do see some other uses of this in the code base, like on lines 267, 306 and 423 of util_io.cc.

And we use const all the time on for-loop variables, and those are local variables too!

I guess my overall opinion is that this might be an area where some inconsistency is fine.

Inconsistency is fine by me too, I guess I'd just prefer that it feel less arbitrary. Are there any loose guidelines you'd propose?

Edit: I meant "random" not "arbitrary," some arbitrariness is probably inevitable.

I guess my suggestion would be to lean towards using const when you can and when it would help.

So for example, if a local variable is only alive for a few lines and it is clear what is being done to it, the const doesn't help very much. But for a medium sized or longer function, annotating something as const can give the reader some valuable info about what is going on.

I guess the other guideline I would suggest is that consistency at the function level is more important than any sort of global consistency. That is: if my function has three variables and I mark two of them as const, then there is a slight presumption that the third non-const variable gets mutated somewhere. So don't do that unless it is.

cxx/pclean/pclean.cc

emilyfertig · 2024-08-05T19:44:59Z

cxx/pclean/pclean.cc

+
+      std::vector<std::string> entities;
+      for (size_t j = 0; j < num_domains; ++j) {
+        // Assume that each row of the dataframe is its own entity, *and*


I don't understand this block of code -- it looks like we're assuming that each entity in each domain has the same (sequential) value for each observation, so the table looks like:

D1 D2 D3 val 0 0 0 x 1 1 1 y 2 2 2 z ...

is that right? For Model 5, don't we want to read these in from the observations?

Also, related to the comment, I think we do want to assume that each row of the observation dataframe is its own entity (and is indexed by a primary key domain, like #138 describes), but we don't necessarily want to assume that all ancestor entities are distinct from those of any other entity (adding the index domain should let us avoid that).

It is true that in Model 5 we need to assume which entities exist and what their relationships are. I guess we could create a file with that information, but given that we never know that information "in real life", I think it is better to generate it programatically. When we get to model 7, we will need to generate the entity assignments programatically anyway (and then allow them to make transitions).

Just to make sure I'm being clear, let me give a concrete example: the assets/rents_dirty.csv input file, which looks like
Column1,Room Type,Monthly Rent,County,State
0,studio,486.0,Mahoning County,OH
1,4br,2152.0,Clark County,NV
2,1br,1267.0,Gwinnett County,GA
...

There are two classes in the model: Obs and County. I agree with you that it is reasonable to assume that each row corresponds to its own Obs entity. (For now! For Model 7, we will want the ability to sometimes say that the model thinks that two rows are duplicates of the same underlying entity; that's one of the cleaning operations we want a PClean-type program to be able to do.) But we aren't given the County entity assignment for each row, and it's not trivial to guess it either, as sometimes either or both of the county name and state fields are missing. Given that, I think that it is reasonable to initialize the County entity assignment to be unique for each row (which is what my code currently does).

I think for C*/the "Record" class, we always want to assume that each row is a separate entity and is in one-to-one correspondence with rows of the observed table. A clearer example might be the schema you've defined in other tests, which has as C*

class Record physian ~ Physician practice ~ Practice

The entities of Record are in one-to-one correspondence with the observations, but the physicians and practices they refer to are duplicated.

For testing Model 5, I think (though I'm not sure) that we want to assume we know the entities and their relationships "in real life." This seems important to be able to test that entity clustering is as we expect (we could also hand-build HIRM-like schemas/observations that replicate PClean-style databases, but it would be nice if we could just define PClean-like ones directly). We should probably clarify this with the MIT folks -- I'll post a slack message.

For sampling, eventually we'll want to sample from a County CRP instead of assuming they're unique, using something like the HIRM sampling method introduced in the 080124-emilyaf-sample-hirm branch (WIP).

I've posted about it on slack.

My preference would be to check this code in as-is (or possibly with a TODO or issue to change it depending on the result of the slack conversation). What do you think?

cxx/pclean/pclean.cc

emilyfertig · 2024-08-06T03:27:13Z

cxx/pclean/pclean_lib_test.cc

+    {"County",
+      T_noisy_relation{{"County", "Obs"}, false, EmissionSpec("bigram"), "County:name"}},
+    {"State",
+      T_noisy_relation{{"County", "Obs"}, false, EmissionSpec("bigram"), "County:state"}}};


For some of the T_noisy_relations, is_observed should be true, right?

translate_observations doesn't use that field of the schema at all, so I don't think it matters at all for the purposes of this test.

I think it matters a little for self-documentation, so I'd prefer to set it to true just for readability/consistency. In the HIRM data-ingesting code, these are set to true after observations of this relation are encountered.

Thanks (sorry for the back and forth, there's a lot going on in this model that's subtle but I think we're getting towards a shared understanding).

It looks to me like the HIRM schema in this test is implementing the following PClean-like schema, is that right?

class County name ~ bigram state ~ stringcat Obs county ~ County rent ~ normal room ~ stringcat

Is this what schema_helper.make_hirm_schema() would output? If so, it doesn't quite conform to how I understand Model 5 (and I missed this in earlier code reviews). I would expect the HIRM schema from the above PClean schema to look like this:

T_schema schema = { {"Obs:county:name", T_clean_relation{{"dCounty"}, false, DistributionSpec("bigram")}}, {"Obs:county:state", T_clean_relation{{"dCounty"}, false, DistributionSpec("stringcat", state_params)}}, {"Obs:room type", T_clean_relation{{"dObs"}, false, DistributionSpec("stringcat", br_params)}}, {"Obs:monthly rent", T_clean_relation{{"dObs"}, false, DistributionSpec("normal")}}, {"County", T_noisy_relation{{"dCounty", "dObs"}, true, EmissionSpec("bigram"), "Obs:county:name"}}, {"State", T_noisy_relation{{"dCounty", "dObs"}, true, EmissionSpec("bigram"), "Obs:county:state"}}, {"Room Type", T_noisy_relation{{"dObs"}, true, EmissionSpec(...), "Obs:room type}}, {"Monthly Rent", T_noisy_relation{{"dObs"}, true, EmissionSpec(...), "Obs:monthly rent}}, };

I think this was the source of a lot of my earlier confusion. According to my reading of Model 5, the Obs/Record class should be treated like all of the other latent classes in terms of defining the clean relations. What's special about it is that there's a one-to-one correspondence between its entities and rows of the observed data. The columns of the observed data are all represented by noisy relations, and all of them have "Obs" as an input domain. If you look at the paragraph on page 10 of the Overleaf that begins "Given all this information", I think this is what it implies (in particular, we need an $I_{r_{C.a}}$ for $C = C*$ and the current code omits that).

If you agree, could you make that change in a follow-up PR?

Actually, thinking about this more, I think my read is wrong and you're right, that the observations that come directly from $C*$ are clean. No changes are needed here or to make_hirm_schema(), and I'm much less confused now.

It's hard to say for sure, since the schema you give doesn't parse. I think you might mean

`class County
name ~ string
state ~ stringcat(strings="...")

class Obs
county ~ County
rent ~ real
room ~ stringcat(strings="...)

observe
county.name as County
county.state as State
room as "Room Type"
rent as "Monthly Rent"
from Obs
`

But anyway, the schema used in pclean_lib_test is not quite what make_hirm_schema would output. Sorry if that caused confusion, but I was optimizing for something simple that would exercise translate_observations.

From the above schema, make_hirm_schema would output four clean_relations named "County:name", "County:state", "Obs:rent" and "Obs:room", and four noisy_relations named "County", "State", "Room Type" and "Monthly Rent". Other than the slight name differences, they are basically the same as you have in your expected output.

Sorry for writing a non-parsable schema, I just wanted to verify the class definitions, and you're right, I was assuming there was an observation for each attribute.

Thanks for clarifying what make_hirm_schema would output. The Model 5 section of the overleaf seems ambiguous as to whether we should have noisy observations for attributes in $C*$ (as make_hirm_schema implements) or whether $C*$ is assumed to contain clean observations (as in the test you wrote). We should ask them to clarify (and fix it in a follow-up if need be, not this PR).

emilyfertig · 2024-08-06T03:30:42Z

cxx/pclean/pclean_lib_test.cc

+      T_clean_relation{{"Obs"}, false, DistributionSpec("stringcat", br_params)}},
+    {"Monthly Rent",
+      T_clean_relation{{"Obs"}, false, DistributionSpec("normal")}},
+    {"County",


For clarity, could you give the "County" and "State" relations names that disambiguate them from the domains? These names would be output from the schema converter as "Obs:county:name" and "Obs:state:name", right?

I've prepended a "d" to all of the domain names for clarity.

The schema converter would give these relation names of "County" and "State", actually -- for noisy relations coming from an "observe x as y" line, the relation name is always taken as the "y", so that it can match the csv column name.

emilyfertig · 2024-08-06T03:40:47Z

cxx/pclean/pclean_lib.cc

+
+      std::vector<std::string> entities;
+      for (size_t j = 0; j < num_domains; ++j) {
+        // Give every row it's own universe of unique id's.


Now that I think about this more, I don't think we should initialize unobserved "observations" here. Either we should sample them from the generative model once we have an HIRM instance (which is what we'll do for Model 7 anyway), or for the purposes of Model 5/6 we should read them in from a file (depending on whether we decide that's necessary with the MIT folks -- sorry I haven't had a chance to post on Slack yet).

For the purposes of this PR, I think we can just omit unobserved values from the observations (i.e. assume only relations with is_observed == true have observations).

You are using the word "unobserved" in a different way than I would use it.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

Ok, I think I see. The observed relations should be exactly those defined by the observe statement, so we can declare a relation to be "observed" if it appears in that statement, right? I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

Assuming the observed relations are those defined by the observe statement (and those which have observed data) I think these are equivalent -- everything that isn't observed (which should be the base relations of noisy relations), we have to infer.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

Ok, I think I see. The observed relations should be exactly those defined by the observe statement, so we can declare a relation to be "observed" if it appears in that statement, right?

Well, again, I think the best/safest/most intuitive thing to do is to call a relation observed iff we see data for it. So in line with that, I would mark relations as is_observed inside the incorporate_observations function.

But since we don't do that currently (is_observed is marked in load_observations), I've done the next best thing and declared the relations created from the "observe" clause in the schema as is_observed in this pull request.

I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

Not quite. If I have a schema and a csv file, and then I drop a column from the csv file, I wouldn't call it user error to run pclean on the combination, even though there would be a observe relation without any observations. I think it would be quite easy to support that use case, so we probably should (while perhaps agreeing that supporting it isn't the most important priority for a research codebase).

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

Assuming the observed relations are those defined by the observe statement (and those which have observed data) I think these are equivalent -- everything that isn't observed (which should be the base relations of noisy relations), we have to infer.

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

I would also add that this conversation now has almost nothing to do with this pull request, and should probably be moved to a different channel (like a meeting or github issue).

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

Could you add a TODO to sample the non-index domains from a CRP prior instead of assuming each entry is a unique entity? (We might want to rethink this more substantially for Model 7, i.e. whether it makes sense to initialize the entities as part of the data ingestion process or elsewhere. Other inferred values, e.g. latent values of a noisy relation, are initialized during incorporate, which might make more sense).

Added a TODO to discuss and consider other options.

I would also add that this conversation now has almost nothing to do with this pull request, and should probably be moved to a different channel (like a meeting or github issue).

For me, this conversation was very relevant, and essential in allowing me to verify that this PR does the right thing, so I appreciate you taking the time to clarify.

But since we don't do that currently (is_observed is marked in load_observations), I've done the next best thing and declared the relations created from the "observe" clause in the schema as is_observed in this pull request.

I was going to file a Github issue, but I think we should leave this as-is. Downstream in HIRM !is_observed implies the value is latent and needs to be inferred, so setting is_observed according to the "observe" clause gives the right semantics.

I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

Not quite. If I have a schema and a csv file, and then I drop a column from the csv file, I wouldn't call it user error to run pclean on the combination, even though there would be a observe relation without any observations. I think it would be quite easy to support that use case, so we probably should (while perhaps agreeing that supporting it isn't the most important priority for a research codebase).

That's fine, I was thinking it made sense to support a more limited notion of valid inputs before potentially supporting something more general, but I agree it doesn't matter much here.

emilyfertig

This LGTM pending a few last clarifications in the comments.

emilyfertig

Thanks! In retrospect some of this might have been better pre-discussed in a design doc, but I think it got to a good place.

ThomasColthurst added 2 commits July 31, 2024 18:50

Load and incorporate observations in pclean main

1484d87

Fix build error

ced702a

ThomasColthurst requested a review from srvasude July 31, 2024 19:02

Don't incorporate missing values.

81791f6

emilyfertig reviewed Aug 1, 2024

View reviewed changes

cxx/pclean/pclean.cc Outdated Show resolved Hide resolved

emilyfertig reviewed Aug 5, 2024

View reviewed changes

ThomasColthurst added 3 commits August 5, 2024 19:57

Fix merge conflict

d94d7c3

Add test for translate_observations

bb58176

Add workaround to std::getline bug

027b484

emilyfertig reviewed Aug 6, 2024

View reviewed changes

Prepend d to domain names in test for clarity

a016538

ThomasColthurst mentioned this pull request Aug 6, 2024

Incorporating PClean flights data causes hang in string_alignment::topk_alignments #142

Closed

ThomasColthurst added 9 commits August 6, 2024 20:43

Debugging stuff mostly

c25bd27

Fix string alignment bug

f0c4c0c

Merge branch 'master' into 073124-thomaswc-incobs

503a176

Remove debugging printfs

ecd74e5

Merge branch 'master' into 073124-thomaswc-incobs

e0ac578

Fix CountyName in hospitals.schema and improve error message in util_io

b6e508e

Mark observable relations in test

116e406

Mark query-based relations as observed

fcfa603

Fix schema_helper_test.cc

ab02dbc

emilyfertig reviewed Aug 7, 2024

View reviewed changes

Add TODO requested by reviewer

fb10d29

emilyfertig approved these changes Aug 7, 2024

View reviewed changes

ThomasColthurst merged commit 62c13a2 into master Aug 8, 2024
2 checks passed

ThomasColthurst deleted the 073124-thomaswc-incobs branch August 8, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load and incorporate observations in pclean main #120

Load and incorporate observations in pclean main #120

ThomasColthurst commented Jul 31, 2024

ThomasColthurst commented Aug 5, 2024

emilyfertig Aug 5, 2024

ThomasColthurst Aug 5, 2024

emilyfertig Aug 5, 2024 •

edited

Loading

ThomasColthurst Aug 6, 2024

emilyfertig Aug 5, 2024

ThomasColthurst Aug 5, 2024

emilyfertig Aug 5, 2024

ThomasColthurst Aug 6, 2024

emilyfertig Aug 6, 2024

ThomasColthurst Aug 6, 2024

emilyfertig Aug 7, 2024

ThomasColthurst Aug 7, 2024

emilyfertig Aug 7, 2024 •

edited

Loading

emilyfertig Aug 7, 2024

ThomasColthurst Aug 7, 2024

emilyfertig Aug 7, 2024

emilyfertig Aug 6, 2024

ThomasColthurst Aug 6, 2024

emilyfertig Aug 6, 2024 •

edited

Loading

ThomasColthurst Aug 6, 2024

emilyfertig Aug 7, 2024 •

edited

Loading

ThomasColthurst Aug 7, 2024

ThomasColthurst Aug 7, 2024

emilyfertig Aug 7, 2024

ThomasColthurst Aug 7, 2024

emilyfertig Aug 7, 2024

emilyfertig Aug 7, 2024

emilyfertig left a comment

emilyfertig left a comment

Load and incorporate observations in pclean main #120

Load and incorporate observations in pclean main #120

Conversation

ThomasColthurst commented Jul 31, 2024

ThomasColthurst commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig left a comment

Choose a reason for hiding this comment

emilyfertig left a comment

Choose a reason for hiding this comment

emilyfertig Aug 5, 2024 •

edited

Loading

emilyfertig Aug 7, 2024 •

edited

Loading

emilyfertig Aug 6, 2024 •

edited

Loading

emilyfertig Aug 7, 2024 •

edited

Loading